Below are 100 randomly selected rows from the dataset.
The table below shows several metrics calculated against the various columns/variables. These metrics include: the number of unique values, number of NAs, the maximum value, the minimum value, and the mean/average.
A quick look at counts of the various levels of the categorical variables found in the data is shown in the following set of bar charts. Trelliscopejs is an R package that enables one to quickly sort and/or filter through various slices of their previously generated visualizations/plots of their data. In one case we may want to look through all levels of the variable, dna_visittrafficsubtype (I don’t show that here). To do this we simply create a generic plot, apply only one slice/level/subset (subtype) at a time to the plot, and create any features that might help us learn about the effects of dna_visittrafficsubtype on the data. We can then sort and filter through the feature set to find anomalies or interesting behaviors that might influence modeling or help us understand the relationship between variables in the data.
For this dataset the number of categorical variables is not overwhelming so the power of trelliscope may appear limited. However, when one needs to quickly move through 100s, 1000s or even millions of slices of the data, trelliscope provides a seemless interface to manage so many subsets of data and better discover interesting features.
The only numeric variable, xsell_day_exact, is highly skewed so a log transformation better shows the distribution of values.
As noted in the section below, outliers and NAs make it difficult to initially see the trends over time of these two time series related variables.
33,407 prospectids appear multiple times, accounting for 85,349 rows in the dataset.dna_visittrafficsubtype has roughly 15 non-NA subtypes over 2000, meaning many subtypes contain very few (for some it’s almost no) observations in this dataset. May prefer to combine subtypes later.xsell_gsa as categorical.xsell_day_exact appears to have a funky distribution, quite log looking. However, the histogram does ignore any zero values, which is roughly 33.8% of the values for that variable. May want to look at a cummulative view of some kind.ordercreatedate has a date of 1900-01-01 repeated 12 times in the dataset. (Suggest removing these from the dataset.)dnatestactivationdayid has 29% of the values as NA.A closer look at dna_visittrafficsubtype shows that many of the subtypes are rarely found in this dataset. Grouping or combining these in a meaningful manner may help, but unfortunately I doubt I have sufficient information or experience to group the levels of this variable at this time.
I must not understand the regtenure column yet. I would have assumed that regtenure was just a categorical view of xsell_day_exact. From the histograms below I must be missing something and those two variables are not linked in any way.
After removing the outlier dates (noted above) for ordercreatedate we can better see the general trend. Though really to see that trend one would need to zoom into the main group of data (not include the high points from last year’s sale around Thanksgiving). In fact, after zooming in to see that general positive sloped trend one notices several days of observations that are very low and warrant further consideration (March 21-30, 2018).
After removing the NAs from dnatestactivationdayid we can better see the general trend.
Variance appears to tighten up in 2016-2017 and the obvious drop in late 2016 to 2017 will cause problems for most models. Forecasting or predicting using the full dataset could prove difficult if the model isn’t able to account for the sudden drop (which most models cannot adequately account for the drop). The drop is also temporary, meaning it will always exist at the end of the time series because cross-sell conversion requires time before it matures to the real number. Recent orders purchased (say in the last week, or end of the time series) will have a very low percent of cross-sells, because most of the cross-sells require more time to occur. A few weeks/months later that cross-sell percentage will have increased dramatically. In short, modeling the cross-sell percentage will always have this limitation (recent drop) and one would have to overcome that by weighting past observations higher than the most recent (very tricky) or more likely one would just model up to the “mature” portion of the cross-sell percentage. I can think of several ways to use the drop information in modeling but that would be a very academic exercise and not necessarily improve the overall prediction - it may however speed up reaction to any changes in cross-sell conversion.
A more detailed view of this daily xsell conversion may help us understand what is influencing this behavior and how that might affect model construction. I only show it broken out by the customer type group.